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Abstract 


Head-related transfer functions (HRTFs) describe the spatial filtering of 
acoustic signals by a listener's anatomy. With the increase of computational power, 
HRTFs are nowadays more and more used for the spatialised headphone playback 
of 3D sounds, thus enabling personalised binaural audio playback. HRTFs are tra- 
ditionally measured acoustically and various measurement systems have been set 
up worldwide. Despite the trend to develop more user-friendly systems and as an 
alternative to the most expensive and rather elaborate measurements, HRTFs can 
also be numerically calculated, provided an accurate representation of the 3D 
geometry of head and ears exists. While under optimal conditions, it is possible to 
generate said 3D geometries even from 2D photos of a listener, the geometry 
acquisition is still a subject of research. In this chapter, we review the requirements 
and state-of-the-art methods for obtaining personalised HRTFs, focusing on the 
recent advances in numerical HRTF calculation. 


Keywords: head-related transfer functions, spatial hearing, acoustic measurement, 
numerical calculation, localisation 


1. Introduction 


Head-related transfer functions (HRTFs) describe the filtering of the acoustic 
field produced by a sound source arriving at the listener’s ear. The filtering is the 
effect of the interaction of the sound field with the listener’s anatomy and has 
various properties. First, the incoming sound wave arrives at the ipsilateral pinna, 
i.e., the ear closer to the sound source, and then at the contralateral ear, i.e., the ear 
away from the sound source. This time difference between ipsilateral and contra- 
lateral ear is usually described as the interaural time difference (ITD). Second, 
larger anatomical structures, i.e., torso, shoulders and head, affect frequencies up to 
3 kHz in a comparatively trivial way. As the listener’s torso and head shadow the 
sound wave arriving at the contralateral ear, interaural level differences (ILDs) 
arise. Third, the incoming sound is filtered in a complex way by the shape of the 
listener’s pinnae. These monaural time-frequency-filtering effects become espe- 
cially important for higher frequency regions (above approximately 4 kHz) and 
sound directions inducing the same ITDs and ILDs [1-6]. Humans have learned to 
interpret this acoustic filtering to span an auditory space as an internal model of 
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their natural environment [7]. Because the pinna shape is unique for every person, 
HRTFs are considered listener-specific [8-10], similar to a fingerprint [1-6]. With 
an individually fitted HRTF dataset, it is possible for a person to perceive sounds (in 
a virtual environment) via headphones as if the sounds would originate from their 
(physical) position around the listener. 

Both interaural and monaural features for a single sound direction can be 
represented by a binaural HRTF pair [11]. In signal processing terms, a binaural 
HRTF pair can be described as 


x“ _PrEĂ*S>S) 
LIS DOP) 


where p; and pp describe the sound pressure at a position inside the left and 
right ear, respectively (typically the entrance of the left and right ear canal or a 
position close to the eardrum), x* describes the sound-source position (i.e., dis- 
tance and direction), f describes the frequency and s the listener’s geometry, 
emphasising the listener-specificity of HRTFs. pọ describes the reference sound 
pressure, which is usually the pressure measured at the position of the midpoint of 
right and left ear without the head being present. 

There are several options to set a specific coordinate system to systematically 
describe directions for HRTFs. From the physical perspective, the spherical coordinate 
system is a natural choice; in that case, the origin of the system is placed inside the 
listener's head at the midpoint between left and right ear and the direction is 
described by azimuth and elevation angles, see Figure 1a. In this system, one can 
intuitively define the two main planes: The eye-level horizontal plane, i.e., all direc- 
tions with the elevation angle of zero, and the median plane, i.e., all directions with 
the azimuth angle of zero. The eye-level horizontal plane is also called Frankfurt 
plane and can be anatomically defined as the plane connecting the lowest part of the 
listener's orbital cavity and the highest part of the bony ear canal (meatus acusticus 
externus osseus). This spherical coordinate system resembles a geodesic representation 
widely used in physics, with the poles located at the top and bottom. An alternative 
system that is more relevant from the auditory perspective is given by the interaural- 


(1) 


Figure 1. 

Coordinate systems typically used in the HRTF acquisition and representation. The dashed line represents the 
interaural axis, and the arrow represents the viewing direction. (a) Spherical coordinate system with the 
azimuth and elevation angles. (b) Simple interaural-polar coordinate system with the lateral and polar angles 
obtained by rotation the poles of the spherical system. (c) Modified interaural-polar coordinate system with the 
lateral and polar angles corresponding to the azimuth angle in the horizontal plane and the elevation angle in 
the median plane. 
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polar coordinate system. This system is shown in Figure 1b and can be constructed by 
rotating the poles of the spherical system to the interaural axis, i.e., the axis 
connecting the two ears. A sound direction is then described by the lateral angles 
(along the horizontal plane) and polar angles (along the median plane). The poles are 
then located on the left and right sides of the listener. This simple interaural-polar 
coordinate system was used in various psychoacoustic studies, e.g., [12, 13], and has 
the disadvantage that the lateral angle does not correspond to the azimuth angle. 
Figure 1c shows the modified version of the interaural-polar coordinate system, 
which does not have this disadvantage. Here, the sign of the lateral angle is flipped, 
i.e., in the coordinate system, the positive lateral angles are used for sounds located on 
the left side of the listener. This transformation to a left-handed coordinate system 
has the advantage of having the lateral angle corresponding to the azimuth angle for 
all sources placed in the horizontal plane, and the polar angle corresponding to the 
elevation angle for all sources placed in the median plane. Thus, the modified 
interaural-polar coordinate system offers a better link between the psychoacoustic 
research and audio engineering. In that system, the lateral angle ranges from —90° 
(right ear) over 0° (front) to 90° (left ear), and the polar angle ranges from —90° 
(bottom) over 0° (front) and 90° (up) to 180° (back) and 270° (bottom again). 

The understanding of these coordinate systems is important because state-of- 
the-art acquisitions and representations of HRTFs utilise those systems. For exam- 
ple, Figure 2 shows HRTFs along the Frankfurt and the median plane. These 
various coordinate systems are used in HRTF visualisation, in various HRTF-related 
software packages such as the SOFA toolbox [15], and in auditory modelling, e.g., 
the Auditory Modelling Toolbox (AMT) [16, 17]. 

HRTF acquisition can be classified into three categories: acoustic measurement, 
numerical calculation, and personalisation [18]. 

The acoustic measurement is traditionally designed as the measurement of the 
impulse response between source and receiver in an anechoic or semianechoic 
chamber, describing the transmission path from a sound source to the ear [11, 19]. 
A comprehensive review of the established state-of-the-art acoustic techniques to 
measure HRTFs can be found in [20]. Thus, in this chapter, Section 3, we only 
briefly provide an overview of the traditional acoustic HRTF measurement 
approaches, highlight some of their differences and new trends and focus on the 
requirements for the acoustic measurement. 
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Figure 2. 

HRTF magnitude spectra for the listeners (a) NH236 and (b) NH257, both from the ARI database [14]. Top: 
Spectra along the median plane. Bottom: Spectra along the eye-level horizontal plane. o dB corresponds to the 
maximum magnitude in each panel. 
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Numerical HRTF calculation simulates the acoustic measurement by considering 
a 3D representation of the listener’s geometry and the positions of multiple external 
sound sources, for which the generated sound pressure at the entrance of the ear 
canal is calculated. This technique has become more popular and is the main focus 
of this chapter. To this end, in Section 4, we provide an overview of the principles 
of various numerical calculation approaches including a comparison of the men- 
tioned methods. 

Personalisation of HRTFs describes the process of adapting an existing set of 
generic data guided by listener-specific information, either with the help of 
objective or subjective personalisation method. The objective personalisation has 
been approached from two different domains: the geometric domain, in which 
listener-specific anthropometric data are measured and used to personalise a 
generic geometric model from which HRTFs are then simulated; or the 
spectral domain, in which a generic HRTF set is directly personalised based on 
listener-specific information. Examples for personalisation approaches include 
utilising frequency scaling [21], parametric modelling of peaks and notches [22], 
active shape modelling (ASM) [23], principal component analysis (PCA) in 
both geometric [24] and spectral domains [25-29], multiple regression analysis 
[30], independent component analysis (ICA) [31], large deformation 
diffeomorphic metric mapping (LDDMM) [25, 32], local neighbourhood 
mapping [33], neural networks [34-41] and linear combination of HRTFs [42]. 
Despite many efforts worldwide [43-46], the link between the morphology and 
HRTFs is not fully understood yet, mostly because of the high dimensionality of 
the problem. Most recent tools for studying that link are rooted in aligning 
high-resolution pinna representations to target representations facilitated with 
parametric pinna models [47, 48]. 

In the subjective personalisation, listeners are confronted with several sets of 
HRTFs and an algorithm (usually based on the evaluation of localisation errors, i.e., 
the difference between perceived and actual sound-source location) adapts the 
HRTF sets aiming at converging at listener-specific HRTFs [9, 49]. For an educated 
guess for the initial sets, anthropometric data can be used to pre-scale the HRTF 
sets, or the HRTF sets can be pre-selected via psychoacoustic models [50]. Cluster- 
ing of the HRTF sets can further improve the relevance and reduce the duration of 
the personalisation procedure [49, 51]. 

All these methods aim at providing a specific quality in terms of acoustic and 
psychoacoustic properties. In the following section, we describe the acoustic prop- 
erties and psychoacoustic requirements for human HRTFs, both of which lay the 
base for HRTF acquisition. Then, we briefly describe the most important require- 
ments for the acoustic HRTF measurement, complementing the work of Li and 
Peissig [20]. Finally, we describe approaches for numeric HRTF calculation in 
greater detail. 


2. Head-related transfer functions: acoustic properties and 
psychoacoustic requirements 


In this section, we describe the acoustic properties of HRTFs and relate them to 
psychophysical properties of human hearing with the goal to derive the minimum 
requirements for sufficiently accurate HRTF acquisition by means of perception. 
We analyse spectral, temporal and spatial aspects of HRTFs and consider 
contributions of distinct parts of the human body to these aspects. 

Humans can hear frequencies roughly between 20 Hz and 20 kHz, with fre- 
quencies at the lower end being perceived as vibrations or creaks, and with the 
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upper end decreasing with age and duration of noise exposure [52]. From the 
psychoacoustic perspective, frequencies down to 90 Hz contribute to sound 
lateralisation, i.e., localisation on the interaural axis within the head [53], and up to 
16 kHz to sound localisation, i.e., localisation outside the head [54], defining the 
smallest frequency range for the HRTF acquisition. Figure 2 shows the amplitude 
spectra of a binaural HRTF pair of two listeners. For each listener, the left and right 
columns show HRTFs of the left and right ear, respectively. The top row shows the 
HRTFs along the median, i.e., for the lateral angle of zero, from the front, via up, to 
the back. The bottom row shows the HRTFs along the Frankfurt plane, i.e., the 
horizontal plane located at the eye level. Figure 2 demonstrates that HRTFs vary 
across ears, frequency, sound-source positions and listeners. The bottom panels 
emphasise the difference between ipsilateral and contralateral ear, showing the 
dynamic range, especially for frequencies higher than 6 kHz. 

Assuming the propagation medium is air and a sonic speed of 340 m/s, the 
human hearing frequency range translates to wavelengths approximately between 
1.7 cm and 17 m, resulting in different body parts affecting HRTFs in different 
frequency regions. The reflections of the torso create spatial-frequency modulations 
in the range of up to 3 kHz [1]. This effect can be observed in the top row of 
Figure 2, in the form of elevation-dependent spectral modulations along the 
median plane [55, 56]. Another contribution comes from the head, which shadows 
frequencies above 1 kHz. This effect can be observed in both rows of Figure 2, with 
large changes in the spectra beginning at around 1 kHz [57]. A large contribution is 
that of the pinna: The resonances and reflections within the pinna geometry create 
spectral peaks and notches, respectively, in frequencies above 4 kHz [54]. This 
effect can be observed in the bottom row of Figure 2. 

From the perceptual perspective, the quality of these HRTF spectral profiles is 
important in many processes involved in spatial hearing. For example, sound- 
localisation performance deteriorates when these spectral profiles are disturbed by 
means of introducing spectral ripples [58], reducing the number of frequency 
channels [59] or spectral smoothing [60]. From the acoustic perspective, these 
spectral profiles show modulation depths of up to 50 dB [11], defining the required 
dynamic range in the process of HRTF acquisition. 

The temporal aspects of HRTF acquisition are shown in Figure 3 as the head- 
related impulse responses (HRIRs), i.e., HRTFs in the time domain, of the same 
listeners as in Figure 2. There are a few things to consider. First, the minimum 
length of the measurement is bounded by the length of the HRIRs. Their amplitude 
decays within the first 5 ms, setting the requirement for the room impulse response 
during the measurements [61]. After the 5 ms, the HRIRs decay below 50 dB, setting 
the requirement on the broadband signal-to-noise ratio (SNR) of the measurements. 
Further, because of the human sensitivity to interaural disparities, HRTF acquisition 
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HRTF log-magnitudes in time domain along the eye-level horizontal plane for the same listeners as in Figure 2. 
Note the decay within the first 5 ms. 
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also requires an interaural temporal synchronisation. While sound sources placed in 
the median plane cause an ITD of zero (theoretically, reached only for identical path 
lengths to the two ears), just small deviations from the median plane cause poten- 
tially perceivable non-zero ITDs. Human listeners can detect ITDs being as small as 
10 us [53, 62], defining the interaural temporal precision required in the HRTF 
acquisition process. The ITD increases with the lateral angle of the sound source, 
reaching its extreme values for sources placed near the interaural axis [63, 64]. The 
largest ITD depends on the distance between the listener’s two ears, mostly being 
defined by the listener’s head width and depth [65], reaching ITDs of up to +800 ps. 
That ITD range translates to the sound’s time of arrival (TOA) at an ear varying in 
the range of 1.6 ms, which needs to be considered in HRTF measurement by pro- 
viding sufficient temporal space in the resulting impulse response. 

HRTFs are continuous functions in space, even though, they are traditionally 
acquired for a finite set of spatial positions. From the acoustic perspective, assuming 
an HRTF bandwidth of 20 kHz, at least 2209 spatial directions are required to 
capture all spectro-spatial HRTF variations [66]. While this quite large number of 
spatial directions increases even further when considering multiple sound distances, 
it is in discrepancy with a smaller number of directions usually used in HRTF 
acquisition [11, 67-69]. One reason is the much smaller perceptual spatial resolution. 
From that perspective, the spatial resolution is limited by the ability to evaluate 
ITDs and changes in HRTF spectral profiles, both of which converge in the so-called 
minimum audible angles (MAAs). The MAA indicates the smallest detectable angle 
between two sound sources [70]. It depends on signal type [71, 72] and is minimal 
for broadband sounds [54, 73-75]. The MAA further depends on the direction of the 
source movement. Along the horizontal plane, the MAA can be as small as 1° for 
frontal sounds [76], increasing up to 10° for lateral sounds [77-79]. This translates 
to a high spatial-resolution requirement for frontal directions that can be relaxed 
with increasing lateral angle. Along the vertical planes, the MAA can be as low as 4° 
for frontal and rear sounds [76], increasing up to 20° for other sound directions 
[80]. Note that further relaxation of the requirement for spatial resolution can be 
achieved by using interpolation algorithms in the sound reproduction. For example, 
when using amplitude panning between the vertical directions [81], a resolution 
better than 30° does not seem to provide further advantages for localisation of 
sounds in the median plane [82]. Finally, when it comes to dynamic listening 
situations (involving listener or source movements), the MAAs further increase 
[83]. In order to account for sufficient spatial resolution when applying HRTFs in 
dynamic listening scenarios, the movement of the listener has to be monitored 
additionally to the modelling of sound source movement [84-86]. The minimum 
amount of directions and specific measurement points for a sufficiently sparse 
HRTF set are still current topics of research [87]. 

HRTFs are listener-specific, i.e., they vary among the listeners [21]. The reasons 
for that inter-individual variation are usually rooted in listener-specific morphology 
of the head and ears. For example, the variation in the head width of approximately 
+2 cm across the population causes variation in the largest ITD in the range of +80 
us [88]. Figure 4 shows HRTF-relevant parts of the human body, where Figure 4a 
shows rough measures of the body and Figure 4b shows areas of the pinna respon- 
sible for the distinct spectral features in higher frequencies. The width and depth of 
head and torso have a large effect on HRTFs in the lower frequencies. The inter- 
individual variation in the pinnae geometry causes variations in HRTFs in frequen- 
cies above 4 kHz, with listener-specific differences of up to 20 dB [11]. The inter- 
individual variation in the HRTFs is rather complex because the pinna is a complex 
biological structure—small variations in geometry (in the range of millimetres) may 
cause drastic changes in HRTFs [90] along the vertical planes in high frequencies 


Perspective Chapter: Modern Acquisition of Personalised Head-Related Transfer Functions... 
DOI: http://dx.doi.org/10.5772/intechopen.102908 


Crus superius anthelicis 


SSIS sta\a= 
aaa 


Cymba conchae 


A i | Crus LN 
PAL AITINEN SNY | i i 
VAAN MRSS © Insicura anterior 
LL a; r 

ALA 


Torso 


Figure 4. 

HRTF-relevant parts of the human body. (a): Head and torso represented with simple shapes based on [57]. 
The black arrows denote the relevant measures. (b): Pinna and its distinctive regions. In red, green, and blue the 
concha, fossa triangularis, and scapha, respectively, denote the acoustically relevant areas [48, 56, 89]. 


[11], see Figure 2. However, not all pinna regions affect HRTFs equally [91]. 
Basically, the convex curvatures of the pinnae contribute to focusing the incoming 
sound waves towards the entry of the ear canals, comparable to a satellite dish. 
Figure 4b shows the anatomical areas important for localisation of sounds [48, 56, 
89, 92, 93]. Currently, the description of the pinna geometry is not a trivial task. 
Pinnae have been described by means of anthropometric data stored in various data 
collections, e.g., [67, 69, 88, 94-96]. While the parameters used in these data 
collections do not seem to completely describe a pinna geometry from scratch, 
recent efforts aim at parametric pinna models able to generate non-pathological 
pinna geometries for arbitrary listeners [47, 48]. Such models describe the pinna 
geometry by means of various control points placed on the surface of a template 
pinna geometry. Figure 5 shows two examples of the implementation of such 
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Figure 5. 

Examples of parametric pinna models. (a): Model from [47] consisting of Beziér curves (depicted in green), 
their control points (black spheres at both ends of a curve) and weights (not shown), linked to a template pinna 
geometry, (b): Model from [48], defined by control points of the pinna relief (green points) linked to proximal 
mesh vertices. 
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models. In Figure 5a, the pinna geometry is parametrised with the help of Beziér 
curves, i.e., polynomials within a spatial boundary [47]. Figure 5b shows a different 
approach; here, the parameterisation of the pinna is utilised with control points that 
move proximal local areas [48]. These parametric pinna models represent a step 
towards understanding the link between HRTFs and specific anatomical regions of 
the pinnae, and provide potential to synthesise large datasets of pinnae, e.g., in 
order to provide data for machine-learning algorithms. 

In addition to the geometry, skin and hair may have an impact on HRTFs 
[97, 98] because of their direction-dependent absorption of the acoustic energy, 
especially at high frequencies. However, recent studies have shown that hair does 
not influence the localisation performance, but rather the perception of timbre 
instead [95, 99-101]. 


3. Acoustic measurement 


The principle of an acoustic HRTF measurement relies on the system identifica- 
tion of the HRTF considered as a linear and time-invariant system. Here, an HRTF 
describes the propagation path between a microphone and a loudspeaker. Because 
of the binaural synchronisation, HRTFs are measured simultaneously at the two 
ears. The measurements are commonly performed for many source positions 
because of the required high spatial resolution. Recently, the details of the acoustic 
measurements, including a comprehensive list of HRTF measurement sites has been 
reviewed [20]. Thus, we only briefly introduce the basics and focus on the most 
recent advances in the acoustic HRTF measurement. 

Typically, two omnidirectional microphones are placed in both ear canals, and 
the loudspeakers are arranged around the listener, ideally, with the number of 
loudspeakers corresponding to the number of HRTF positions to be measured. 
Figure 6 shows two examples of measurement setups of various complexity: In 
Figure 6a, the listener is located on a turntable and moves within a fixed near- 
complete circular loudspeaker array. Figure 6b shows a similar approach with a 
near-complete spherical loudspeaker array, and Figure 6c shows the placement of a 
microphone in the ear canal so that it is membrane lines up with the entrance of the 


Figure 6. 

(a) Example of an HRTF measurement setup with mechanical rotation required. Listener sits on a chair 
(mounted on a turntable) surrounded by a loudspeaker arc (22 loudspeakers ranging from —30 to 210° in 5°- 
steps). A head-tracker mounted on the head of the listener tracks head movements, triggering the need for 
measurement repetition in case of too large movements. (b) Example of more recent HRTF setups. 91 
loudspeakers ave mounted in a near-complete spherical array reducing the total measurement duration. (c) 
Example for a microphone placement in an HRTF measurement. Note the closed ear canal and the head- 
tracker sensor. 
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ear canal. Actually, it does not matter whether the microphones or loudspeakers are 
placed in the ear canal—this approach of ‘reciprocity’ is usually facilitated in 
numeric HRTF calculations (Section 4.4). However, setups with loudspeakers in the 
ears [102] lack signal-to-noise ratio (SNR) as the amplitude of the source signal 
needs to be low enough to not harm the listener, making the setup impractical for 
experiments. With the microphones in the ears, the most simple setups consist of a 
single loudspeaker moved around the listener [103]. Unfortunately, such setups 
lead to a long measurement duration for a dense set of HRTF positions. With the 
increasing availability of multichannel sound interfaces and adequate electroacous- 
tic equipment, over the decades, the number of actually used loudspeakers 
increased. Setups with only a single loudspeaker moving around the listener have 
been replaced by setups with loudspeaker arcs surrounding the listener. In those 
setups, the listener sits on a turntable and either the listener (e.g., Figure 6a) or the 
loudspeaker arc is rotated [88, 104]. 

Recent approaches follow one of two different directions; On the one hand, 
generic and individual HRTFs are measured with a growing number of loud- 
speakers used in specialised facilities [67, 95]. Some even with such a large amount 
of loudspeakers that the listener is rotated for a few discrete positions, and post- 
processing algorithms interpolate between HRTF directions, e.g., the setup in 
Figure 6b. On the other hand, user-friendly individual HRTF measurement 
approaches are suggested, showing a trend towards decreasing the complexity of 
the measurement setup and using widely available equipment. In these approaches, 
only a single speaker is used and the listener is asked to move the head until a dense 
setup of HRTF directions can be obtained. These measurements enable simple 
systems to be used at home [105, 106], in which a head-tracking system records the 
listener’s head movements in real time and adapts the measured spatial HRTF grid. 
Head-above-torso orientations have to be considered additionally [100], but they 
reduce the complexity of the measurement setup and enable using widely available 
equipment, e.g., a commercially available VR headset and one arbitrary loud- 
speaker, in regular rooms, thus increasing the user-friendliness for setups [105]. 

Most of those recent approaches consider spatially discrete positions of the 
listener and/or the loudspeakers. In order to tackle the trade-off between high 
spatial resolution and long measurement duration, other recent advances have been 
made towards spatially continuous measurement approaches [107-109]. These 
approaches enable the measuring of all directions around the listener for a single 
elevation within less than 4 minutes [110]. Certainly, an advantage of such an 
approach is the access to the spatially continuous information, which is important 
especially for frontal HRTF directions. With more and more silent turntables 
and swivelled chairs, achieving a high SNR is not a big issue. Most recent 
approaches related to the spatially continuous measurement utilise Kalman filters 
to acquire system parameters representing HRTFs, and thus speed up the HRTF 
measurement in a multi-channel setup [111]. Compared to spatially discrete 
approaches, the spatially continuous method can achieve accuracy within a spectral 
error of 2 dB [109]. 

The requirements of the room are not rigorous: In principle, the measurement 
room does not have to be perfectly anechoic, but it has to fulfil some requirements 
regarding size and reverberation time. Room modes may exist below 500 Hz as they 
can be neglected in that frequency range [1]. Acceptable measurement results can 
be obtained as long as the first room reflection arises after 5 ms such that the 
measured room impulse responses can be truncated without truncating the HRIRs. 
Medium and large surfaces, i.e., the mount of the loudspeakers, the loudspeaker arc, 
the turntable, listener seat, etc., can potentially cause acoustical reflections 
overlapping with the direct sound path within the first 5 ms of the HRTF. These 
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reflections are usually damped, e.g., by covering the speakers in absorption mate- 
rial. Before the measurement, the listener’s head has to be aligned in the measure- 
ment setup, adjusting the ears to the interaural axis of the system and the head to 
the Frankfurt plane. This alignment can be supported by, e.g., a laser system. The 
orientation and position of the listener's head should be monitored throughout the 
measurement procedure in order to detect listeners unwanted movements or posi- 
tion drifts. This helps when having to repeat potentially corrupted measurements. 

The loudspeakers used for the measurements need to show a fast impulse 
response decay; fast enough to not interfere with the temporal characteristics of the 
HRTFs. This can be achieved by using loudspeaker drivers with light membranes, 
simple electric processing and no acoustic feedback such as a bass-reflex system. 
The acoustic short-circuit usually limits the lower frequency range of the loud- 
speakers, and multidriver systems are a common solution to that problem. In order 
to achieve a spatially compact acoustic source in a multidriver system, it is common 
to use coaxial loudspeaker drivers with an omnidirectional directivity pattern in 
HRTF measurement systems [112]. 

The placement of the microphones can also be an issue. Early setups used an 
open ear canal where the microphones were positioned close to the eardrum [11]. 
However, the effect of the ear canal does not seem to be direction-dependent, and 
its consideration in the measurement introduces technical difficulties and a large 
measurement variance [19, 113, 114]. Nowadays, the microphones are usually 
placed at the entrance of the ear canal which is acoustically blocked [11, 20]. 
Blocking the ear canal can be achieved by using microphones enclosed in earplugs 
made from foam or silicone or by wrapping the microphone in skin-friendly tape 
before inserting it. Note that such a measurement captures all directional- 
dependent features of the acoustic filtering by the outer ear, however, the 
directional-independent filtering by the ear canal is not captured. All cables from 
the microphone have to be flexible enough to minimise their effect on the acoustics 
within the pinna—one way is to lead the cable through the incisura intertragica and 
secure it with tape on the cheek, see Figure 6c. 

In general, system identification can be performed with a variety of excitation 
signals. While previously Golay codes or other broadband signals have been used 
[115], more recently, the multiple exponential sweep method (MESM) [112] has 
been established and further improved [116], enabling fast HRTF measurement at 
high SNRs, reducing the discomfort for the listener. Still because of the imperfec- 
tions in the electro-acoustic setup, a reference measurement is required to estimate 
the basis of the measurement without the effect of the listener, i.e., to estimate pọ. It 
is typically done for each microphone by placing the microphone in the centre of 
the measurement setup and recording the loudspeaker-microphone impulse 
response for all loudspeakers. The reference measurement can also be used to 
control the sound pressure level in order to avoid clipping at the microphones and 
analogue-digital converters. This can especially happen when each loudspeaker is 
driven within its linear range, but the overlapping signals from multiple loud- 
speakers raise the total level to ranges beyond the linear range of the recording 
system. Additionally, because of the HRTF’s resonances, the level during the actual 
HRTF measurements can be up to 20 dB higher than that during the reference 
measurement, translating to a requirement for the headroom of at least 30 dB at the 
reference measurement. The maximum level is not only limited by the equipment; 
the listener’s hearing range also needs to be considered, i.e., the maximum sound 
pressure level must neither create discomfort for the listener, nor go beyond the 
levels of safe listening. There is no special requirement for the listener regarding 
their audible threshold, hearing range or the visual sense. However, a particular 
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measurement equipment or a particular lab could have some restrictions on, e.g., 
the listener’s weight or height. 

Figure 7 shows measurement grids of three exemplary setups and one measure- 
ment grid of a simulation setup. Figure 7a and b correspond to the measurement 
setups in Figure 6a and b. In these setups, not every loudspeaker plays a stimulus at 
every position around the listener. An extreme case is a loudspeaker positioned at 
90° elevation that needs to be only measured once. Figure 7c shows another setup 
with uniformly distributed measurement points, and Figure 7d shows a uniform 
sampling grid used in numerical calculation experiments [95]. 

The repeatability of the measurement is an important issue. Within a single 
laboratory, changes in the room conditions such as temperature and humidity, as 
well as changes in the setup such as the ageing of the equipment may compromise 
the repeatability of the HRTF measurement [11, 20]. When comparing the HRTFs 
measurement across the labs, differences in the setups play also a role. In inter- 
laboratory and inter-method HRTF measurement comparison obtained for the same 
artificial head, severe ITD variations of up to 200 yw have been found [63, 64]. 

Once the HRTFs have been measured for all source positions, post-processing 
needs to be done before the HRTFs are ready to be used. First, in order to account 
for acoustic artefacts caused by the measurement room, a frequency-dependent 
windowing function is usually applied truncating the HRIRs [100, 117, 118]. Sec- 
ond, the measured HRIRs are equalised by the impulse response obtained from the 
reference measurements, i.e., with the microphone placed at the centre of the 
coordinate system with the listener absent. This equalisation can be either free-field 
or diffuse-field. For the free-field equalisation, the reference measurement is 
required only for the frontal direction (0° azimuth, 0° elevation) [54], whereas for 
the diffuse-field equalisation, the reference measurement is the root mean square 
(RMS) impulse response of all directions [75], and the results are commonly 
denoted as directional transfer functions (DFT) [119]. Third, in most common 
rooms and even in (semi)anechoic rooms, reflections (or room modes) cause arte- 
facts below 400 Hz, confounding the free-field property of HRTFs. Additionally, 
most loudspeakers used in the measurement are not able to reproduce low frequen- 
cies with sufficient power. Since the listener’s anthropometry has a small effect on 
HRTFs in the low-frequency range, HRTFs can be extrapolated towards lower 
frequencies with a constant magnitude and linear phase [20, 117]. Further post- 
processing steps may include spectral smoothing to account for listener position 


(c) (d) 


Figure 7. 

Four examples of spatial HRTF grid resolutions. (a) Almost spherical loudspeaker arc with moving listener, see 
also Figure 6a; the loudspeaker arc consists of 22 loudspeakers and yields 1550 directional measurements. 
Notice the higher resolution in the front of the listener as opposed to the lower resolution in the lateral regions. 
(b) Sparse measurement grid in a nearly spherical 91-loudspeaker array, see also Figure 6b, yielding 451 
directional measurements with 5 different listener positions. (c) Measurement grid with 440 uniformly 
distributed points, measured with a loudspeaker arc with 37 loudspeakers from the HUTUBS database [95]. 
(d) Sampling grid for numerical calculations with 1730 directions. (c) and (d) are taken from the HUTUBS 
database [95]. 
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inaccuracies [60, 120] or adding a fractional delay to account for temperature 
changes followed by onset changes of the time signals [100]. 

The availability of acoustical HRTF measurements was a big step towards 
personalised binaural audio and virtual reality experience. However, even a fast or 
continuous measurement method requires the listener to sit still for a few minutes 
[104, 110, 112] in a specialised lab facility. Recent advances have been made 
towards both large-scale high-resolution and small-scale at-home easy-to-use solu- 
tions, providing HRTF acquisition to a large audience. Still, the imperfections in the 
electro-acoustic equipment set drawbacks of the acoustic measurement. Here, 
recent advances in the numeric calculations of the HRTFs can provide an interesting 
alternative. 


4. Numerical calculation of HRTFs 


Generally, the calculation of HRTFs simulates the effects of the pinna, head and 
torso on the sound field at the eardrum. The goal is to numerically obtain the sound 
pressure at the two ears for a given set of frequencies and spatial positions. There 
are many methods to simulate wave propagation [121]. When applied to the HRTF 
calculation, all of the methods require a geometric representation of head and 
pinnae as input. For an accurate set of HRTFs, an exact 3D representation of the 
geometry, especially that of the pinnae with all their crests and folds, is of utmost 
importance [90]. The 3D geometry is represented using a discrete and finite set of 
elements, further denoted as ‘mesh’. A mesh is a representation of the region of 
interest (ROI), i.e., the object’s volume and surface, with the help of simple geo- 
metric elements. In most applications, the faces of these elements are assumed to be 
flat, which in turn explains the preference for triangular faces because they are 
always flat and therefore have one unique normal vector. This is not always the case 
for other shapes, e.g., quadrilaterals. 

The requirements on the mesh have to consider geometrical as well as acoustical 
aspects. From the acoustic perspective, a typical rule of thumb for numerical calcu- 
lation requires the average edge length (AEL) of elements to be at least a sixth of the 
smallest wavelength [122], which corresponds to an AEL of 3.5 mm for frequencies 
up to 16 kHz. However, in order to describe the pinna geometry sufficiently accu- 
rate, the average edge length (AEL) of the elements in the mesh needs to be around 
1 mm, independently of the calculation method [90]. Some numerical calculation 
algorithms are, in general, more efficient and stable if the geometries are 
represented locally with elements of similar sizes and as regular as possible, e.g., 
almost equilateral triangles. To this end, the mesh may undergo a so-called 
remeshing [123], which inserts additional elements and resizes all elements to a 
similar size. Figure 8 shows the same pinna in all panels, represented by meshes 
with increasing AELs from left to right. 


Figure 8. 
Pinna meshes represented by various AELs [90]. From left to right: AEL of 1 mm, 2 mm, 3 mm, 4 mm and 
5 mm. Note how the representations of the helix and fossa triangularis degrade with increasing AEL. 
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Interestingly, only the pinna regions contributing to the HRTF (compare 
Figure 4b) require to be accurately represented [56] and the remainder of the 
geometry can be more roughly modelled. This applies especially to the head, torso 
and neck, which can be represented by larger elements. These anatomical parts can 
additionally be approximated by simple geometric shapes, e.g., a sphere for the 
head, a cylinder for the neck and a rectangular cuboid or an ellipsoid representing 
the torso [65], see e.g., Figure 4a. To emphasise the sophisticated direction depen- 
dency of the pinna, Figure 9 shows the calculated sound pressure distribution over 
the surface of the pinna. This simulation is calculated by defining one element in the 
centre of the ear canal as a sound source and evaluating the resulting sound pressure 
field at the vertices of the rest of the geometry; the procedure is explained thor- 
oughly in Section 4.4. 

The geometry can be captured via numerous approaches [124]: a laser scan 
[125], medical imaging techniques such as magnetic resonance imaging (MRI) 

[69, 126] and computer tomography (CT) [90], or photogrammetric reconstruction 
[127]. Laser, MRI and CT scans yield high-resolution meshes offering a small geo- 
metric error, but in turn, they need a special equipment. The laser scans are based 
on line-of-sight propagation and are able to measure short distances with an accu- 
racy of up to 0.01 mm. The downside of line-of-sight propagation is that the mani- 
folds of the pinnae are not easy to capture. In the medical imaging approaches, 
different downsides arise; acquiring the pinnae geometry via MRI is not a trivial 
process because they are flattened by the head support. This leads to two separate 
MRI measurements of each ear. The anatomy is then captured in ‘slices’ that can be 
stitched together in the postprocessing rather easily. The CT captures the anatomy 
in a similar way, but due to the high radiation exposure, such scans are usually not 
done with human subjects but with (silicone) mouldings of the listener’s ear. The 
overall procedure may take more time than an acoustic HRTF measurement and 
require the listener to either manufacture a moulding or meeting rather specific 
criteria for the scanning equipment (e.g., no tattoos, piercings, or implants). As an 
alternative, recent advances have been made for more widely applicable approaches 
such as photogrammetry [23, 128]. Photogrammetry is not only non-invasive but 
also can be done with widely available equipment, e.g., a smartphone or digital 


sound 

pressure 

maximum 

sound 

pressure 

minimum 
Figure 9. 


Magnitude of the sound pressure calculated for each element of the surface for a 13-kHz sound source placed in 
the ear canal. Note the high dynamic range containing peaks (red) and notches (blue) in the distribution 
pattern in the area of the pinna. 
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camera, without having the listener to travel to a specialised facility. In a nutshell, 
the photogrammetrical approach works as follows: a set of photographs from dif- 
ferent directions is made for each ear [127, 129], the so-called structure from motion 
[130] approach estimates the camera positions by analysing the mutual features 
across the photographs; a 3D point cloud is constructed; and a 3D mesh is created by 
connecting the points in the cloud. Note, that currently, manual corrections (e.g., 
smoothing to reduce noise, filling holes) are still required to reach the high quality 
of the meshes required for accurate HRTF calculations. 

Simulations of acoustics require the information about the acoustic properties of 
the simulated objects. The HRTFs can be simulated with the 3D geometry 
represented as fully reflective, i.e., all surfaces having infinite acoustic impedance. 
With respect to localisation performance, only a small perceptual difference was 
found between acoustically measured and HRTFs calculated for acoustically reflec- 
tive surfaces [101]. However, the impedance of various regions such as skin and 
hair may influence the direction-independent HRTF properties and cause changes 
in the perceived timbre [95, 99, 100]. 

In order to calculate HRTFs with sufficient spectral accuracy, the number of 
elements needs to be in the range of several tens of thousands, which might be 
important for the requirements of the computational power. Such large numerical 
problems usually require large amount of memory being in the range of Gigabytes. 
Nevertheless, the calculation time may reach a few days, especially when calculat- 
ing HRTFs for many frequencies with high-resolution meshes. Note that if the used 
algorithm calculates HRTFs for each frequency independently, the calculations can 
be performed in parallel, and computer clusters can be used. This reduces the 
calculation time to a few hours for HRTFs the full hearing range and a mesh of 
several tens of thousands of elements. 

All the algorithms for numerical HRTF calculation are based on the propagation 
of sound waves in the free field around a scattering object (also “scatterer”), usually 
described by the Helmholtz equation 

Vp(x) + k*p(x) = q(x), x E Q, (2) 
where V? = a F & 
(complex valued) sound pressure at a point x for a given wavenumber k in the 
domain Q, around the scattering object and q(x) denotes the (complex-valued) 
contribution of an external sound source at the acoustic field around the object. The 
wavenumber k is er with f being the frequency and c the speed of sound. 

In order to solve the Helmholtz equation for a given scatterer, boundary condi- 
tions are necessary. The Neumann boundary condition assumes the object to be 
acoustically hard, and the (scaled) particle velocity at the boundary can be set to zero, 


+ 2, denotes the Laplace operator in 3D, p(x) denotes the 


where n denotes the normal vector at the surface pointing away from the object. 
For the boundary condition at infinite distance, the so-called Sommerfeld radiation 
condition can be applied, 


Gar) a ca (=), Ixl] > =, 


with o(.) showing that the right side grows much faster than the left side. This 
ensures that the sound field decays away from the object [131]. 
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For the calculation of HRTFs, the Helmholtz equation can be solved numerically 
by means of various approaches, which are based on a discretisation of the exterior 
domain ©, around the scatter I. Some of these methods solve the Helmholtz 
equation in the frequency domain, and others solve its counterpart, the wave 
equation, in the time domain. In general, the formulations and the results in the 
different domains can be transformed into each other by using, e.g., the Fourier 
transformation. In the following, we describe the most prominent methods used for 
HRTF calculations. 


4.1 The finite-element method 


The finite-element method (FEM) solves the Helmholtz equation, Eq. (2), con- 
sidering the scattering object or the domain around it as a volume [132]. Figure 10 
shows an example of a finite (domain) volume Q, considered in the calculations 
with the scatterer with surface I placed inside that volume. To simulate the acoustic 
field around that object, the weak form of the Helmholtz equation is used, i.e., the 
equation is multiplied by a set of known test functions w(x) and integrated over the 
whole domain, thus transforming the partial differential equation (e.g., the Helm- 
holtz equation) into an integral equation, that can be easier solved numerically: 


| (V*p(x) + k*p(x))w(x)dx = [aww (xax. (3) 
Ò ò, 
Secondly, the unknown pressure p(x) is approximated by a linear combination 


N 


p(x) => poa) (4) 
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Figure 10. 
2D representation of meshes used in FEM. The elements are uniformly distributed and fitted to the boundary of 
the domain Q,. 
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of so-called ansatz functions @ jx) j =1, ...,N. These ansatz functions, or ele- 
ment shape functions, help at interpolating between the discrete solutions for each 
point of the mesh. They are, in general, simple (real) polynomials defined locally on 
the elements of the mesh, e.g., having the value of 1 at their own coordinates and 
zero for other points of the mesh. Recent advances have been made towards adap- 
tively finding higher-order polynomials and thus drastically reducing the computa- 
tional effort [133, 134]. In theory, Eq. (3) should be fulfilled for all possible test 
functions w(x), in practice, however, often the ansatz functions are also used as test 
functions, i.e., w;(x) = ġ;(x). With this choice, Eq. (3) can be transformed into a 
linear system of equations 


Kp = g, (5) 


where 


dx dx 


dd, 49; 
Kj = | $: a E k bio ax, gi = [aids 
Qe Q 


and p is the vector containing the unknown coefficients p; of the representation 
Eq. (4). 

In general, the unknown coefficients p, represent the complex sound pressure at 
a given node x; of the mesh. The integrals involved are solved using numerical 
methods [135]. 

When calculating HRTFs, the space around the scatterer is assumed to be contin- 
uous and infinite; in practice, this space has to be discretised and truncated to a finite 
domain by inserting a virtual boundary. When applied to the calculation of HRTFs, a 
virtual boundary of the (now finite) domain Q, needs to be defined and conditions 
have to be set to avoid any reflections from this boundary, thus keeping in line with 
anechoic or free-field conditions. There are several methods to do so, with the so- 
called perfectly matched layers (PMLs) being the most popular among the reviewed 
HRTF calculation approaches. The PML is created by inserting an artificial boundary 
inside Q,, e.g., a sphere with sufficiently large radius, and artificially damped equa- 
tions are used to represent a solution that can then be numerically calculated, fulfill- 
ing the Sommerfeld radiation condition. Recent advances have been made to define 
PMLs automatically by extruding the boundary layer of the mesh and obtaining the 
geometric parameters during the extrusion step [136]. 

The FEM has been widely used in HRTF calculations [137-141] and yields 
similar results to acoustical HRTF measurements with spectral magnitude errors of 
approximately 1 dB [137, 141]. The downside, however, is the need to model 3D 
volumes around the head, resulting in models of a high number of elements, having 
a strong impact on the calculation duration. 


4.2 The finite-difference time-domain method 


A similar approach as the FEM can also be followed in the time domain. By using 
a short sound burst in the time domain as an input signal, the HRTFs within a wide 
frequency range can be calculated at once. This approach is called the finite- 
difference time-domain (FDTD) method [142] and can be derived by solving the 
wave equation in the time domain 


a Pp, t) _ 
Vp(x,t) — E 2 = q(x,¢), 
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where p and q denote sound pressure fields in the time domain. The PML is 
applied to create the boundary conditions for outgoing sound waves. The evaluation 
grid is sampled evenly in cells across the domain with grid spacing h, considering 
discrete time steps m. A key parameter for numerical stability of the FDTD is the 
Courant number 


defining the number of cells the sound propagates per time step. Typically, in 
order to obtain stable HRTF calculations, the Courant number is p = 1/ V3. 

Figure 11 shows a 2D representation of a mesh used in the FDTD method. Note 
that because the mesh needs to consist of evenly spaced elements, most of the 
objects cannot be represented accurately and a sampling error is introduced at the 
boundary surface I of the object. Additionally, as derivatives of functions are 
approximated by finite differences, the arithmetic operations are valid for infinite 
resolution, but when calculating on physical computers, the precision depends on 
the number format used and the gridsize, introducing errors in the results [143]. 
Refining the grid is a potential solution to the sampling error for staircase approxi- 
mations [144, 145], and when framing this problem to HRTF calculations, spectral 
magnitude errors of 1 dB up to 8 kHz and 2 dB up to 18 kHz can be achieved 
[146, 147], suggesting only negligible increase in localisation errors when listening 
to HRTFs calculated by the FDTD method. 

Because of the additional sampling errors for irregular domains, recent advances 
have been made towards using quasi-cartesian grids [148], dynamically choosing 
grid resolutions [149], or towards the finite-volume method (FVTD), which is 
based on energy conservation and dissipation of the system as a whole and uses the 
integral formulation of the FDTD [150]. One solution approach there is to adap- 
tively sample the grid at the boundary and introduce unstructured or fitted cells 
[151, 152]. A thorough comparison between FEM, FDTD and FVTD methods is 
available in [153]. 

In fact, the FDTD method has been widely applied to HRTF calculations 
[145, 146, 154, 155], and it certainly offers the advantage of calculating broadband 


Figure 11. 
2D representation of meshes used in the FDTD method. Note that in this representation, the object surface T does 
not line up exactly with the sampling grid. 
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HRTFs while not introducing additional computational cost when multiple inputs 
or outputs are used. However, because of the complex geometry of the pinnae, a 
submillimetre sampling grid is required, resulting in the need for a delicate 
preprocessing. 


4.3 The boundary-element method 


The boundary element method (BEM) is based on a special set of test functions 
in the weak formulation of the Helmholtz equation Eq. (3), namely the Green’s 
function 


eikr 


n (x, y) — Anr? 

where r = ||x — y||, and x and y are two points in space. By using this function, it 
is possible to reduce the weak form of the Helmholtz equation to an integral 
equation, i.e., the boundary integral equation (BIE), that only involves integrals 
over the surface T of the object, and not the volume ,. Assuming that the external 
sound source as a point source at x*, and an acoustically reflecting (= sound hard, 
Vp(y) -n = 0,y €T) surface, the sound field at a point x on a smooth part of that 
surface I is given by: 


1 x 
5 P(x) — |x (x y)p(y)dy = G(x", x)po. (6) 
T 
where H (x, y) = a (x, y) is obtained by the derivative of the Green’s function at 


a point y in the direction of vector n normal to the surface at this position, and po 
denotes the strength of the sound source. 

In comparison with the other two methods, the BEM has the advantage that only 
the surface of the object such as the head and the pinnae needs to be discretised, 
whereas in FEM and the FDTD method also a discretisation of the volume sur- 
rounding the head has to be considered, see Figures 10-12. Thus, in the boundary 
element method, all calculations can be reduced to a manifold described in 2D, in 
our case, the domain of interest is reduced to the surface of the head. A second 
advantage of the BEM is that by using the Green’s function, the Sommerfeld radia- 
tion condition is automatically fulfilled. Additionally, no domain boundary has to be 


Q 


e 


Figure 12. 
2D representation of a BEM mesh. Note that only the boundary of the surface T is considered and the domain 
volume Q, does not have to be sampled. 
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introduced, such as the PML. This renders the BEM an attractive method for 
calculating sound propagation in infinite domains, i.e., in free-field, as is the 
assumption when calculating HRTFs [156]. 

In order to solve a BEM problem, the BIE is discretized and solved by using 
methods such as the Galerkin, collocation or Nyström [157-159], all with the com- 
mon goal of yielding a linear system of equations. 

For the Galerkin method, the unknown pressure is approximated by a linear 
combination of ansatz functions as in Eq. (4). The BIE is again multiplied with a set 
of test functions (similar to the test functions ¢;(x) used in FEM) and integrated 
with respect to x yielding a linear system of equations as in Eq. (5), where 


Kij = [eee as — [focn (x,y) pj (y)dydx, 


p 


and 


8; = [ow — x)0;(x)dx. 


Another commonly used approach especially used in engineering is collocation 
with constant elements, i.e., the sound field is assumed to be constant on each 
element of the mesh, and the BIE is solved at the midpoints x; of each element (the 
set of all x; are called collocation nodes) yielding a linear system of equations as in 
Eq. (5), where 


1 x 
Ki = 5 Si = | H(x;,y)dI, g; = G(x", Xi)Po: 


j 


p; = p(xi) and with x* being the position of the sound source outside the head. 
The integrals over each element are numerically solved using appropriate quadra- 
ture formulas (weighted sum of function values) and 


1 for i=j 
a= | ee 
0 for iF¥j 


The BIE is solved for a given set of frequencies and the solutions p, at the 
collocation nodes are then used to derive the HRTFs. Note that the collocation 
method can be interpreted as the Galerkin method utilising the delta functionals 
6(x; — x) as test functions. A thorough comparison between Galerkin and colloca- 
tion approaches can be found in [160]. 

The discretisation of just the surface introduces additional challenges. First, the 
Green’s function becomes singular at the boundary where ||x — y|| = 0 and special 
quadrature formulas need to be used close to these singularities [161, 162]; and 
second, the system matrix K, although small, is usually densely populated, which 
poses a challenge for computer memory and the efficiency of the linear solver used. 
When considering HRTF calculations for frequencies up to 20 kHz, high-resolution 
meshes are required and the corresponding linear systems may contain up to 
100,000 unknowns. 

In order to efficiently deal with such large systems, the BEM can be coupled with 
methods speeding up matrix—vector multiplications, such as the fast-multipole 
method (FMM) [163] or H-matrices [164] (so-called ‘hierarchical’ matrices). These 
methods have enabled an efficient and feasible calculation of HRTFs [165]. Ina 
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nutshell, these methods aim at providing a method for an efficient and fast matrix- 
vector multiplication and are based on two steps. First, the elements of the mesh are 
grouped into clusters of approximately the same size with cluster midpoints z;. 
Second, for two clusters C4 and C2, that are sufficiently apart from each other, a 
separable approximation of the Green’s function 


G(x, y) © Gi(x — z1)M(z1, 22)Go(y — z2) 


is found. This approximation has two advantages: the local expansions G1 and G2 
have to be made only once for each cluster, and the interaction between the ele- 
ments of different clusters is reduced to a single interaction of the cluster mid- 
points. The resulting linear system of equations is then solved using an iterative 
equation solver [166]. 

Although the Helmholtz equation for external problems has a unique solution at 
all frequencies, the BIE has uniqueness problems at certain critical frequencies 
[159, 167]. Thus, to avoid numerical problems, the BEM needs to be stabilised at 
these frequencies, e.g., by using the Burton-Miller method [167]. BEM has been 
widely used to calculate HRTFs [165, 168-171] analysing the process from various 
perspectives. When applied to an accurate and high-resolution representation of the 
pinna geometry, BEM can yield similar results to the acoustic HRTF measurements 
by means of sound localisation performance [101, 172]. 


4.4 Reciprocity 


In principle, in order to calculate an HRTF set, the Helmholtz equation needs to 
be solved for every source position x; separately, leading in up to several thousand 


right-hand sides in Eq. (5). Solving that many equations cannot be done quickly 
even with the help of iterative solvers. On the other hand, the HRTF calculation for 
the second ear is quite simple because the solution obtained from the solver is 
already available for every element of the surface, i.e., at the element representing 
the ear canal of the second ear. The approach of reciprocity can help to significantly 
speed up the calculations by swapping the role of the many source positions with 
the two elements representing the ear canals, requiring us to solve Eq. (5) only 
twice, i.e., once for each of the ears. 

Helmholtz’ reciprocity theorem states that switching source and receiver 
positions do not affect the observed sound pressure. When applied to HRTF 
calculations, virtual loudspeakers are placed in the entrance of the ear canal 
(replacing the virtual microphones) and the many simulated sound sources 
are represented by many virtual microphones (replacing the many virtual loud- 
speakers around the listener). By doing so, the computationally expensive part of 
the BEM, i.e., solving a linear system of equations to calculate the sound pressure 
at the surface, needs to be done only twice, namely once for each ear. Subsequently, 
the sound pressure at positions around the head can be calculated fairly easy and 
efficiently. 

In more detail, assume that a point source with strength py at the position x 


7 
j 
causes a mean sound pressure of p over a small domain with area Ao at the entrance 


of the ear canal. If the domain is sufficiently small, the mean sound pressure is an 
accurate representation of the actual sound pressure in this domain. By applying the 
reciprocity, we introduce a reciprocal sound source at the entrance of the ear canal 


which introduces a velocity vo and then calculate the sound pressure p (x) at the 


actual sound-source position around the listener. The pressures p a and p are 
linked by 
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__ pop(x;) 
p= Aovo ` 

The reciprocal sound source can be modelled by vibrating elements Tip, i-e., 
elements with an additional velocity boundary condition 


v(y) = 2 vos  YErvp 


where w = 2zf ,, and p is the density of air. Note that vo can be an arbitrary 
positive number because when calculating HRTFs [see for example Eq. (1)], the 
pressure is normalised by the reference pressure po, thus cancelling vo. With this 
additional boundary condition, first, BEM can be used to calculate the sound field at 
the surface T, and then, Green’s representation is applied to calculate the pressure at 


all positions of the actual sound sources x; , 


p(x) — Ja (x; : y)P (y)dy + iwp|G (x; f y)o (y)dy = 0. 


Tr r 


Note that this equation is calculated after a discretisation, and because p(y) at 
the surface is known from the BEM solution, the calculation of the sound pressure 


around the head p (x ;) is a simple matrix multiplication. 


Reciprocity, combined with FMM-coupled BEM has been applied to calculate 
HRTFs, enabling calculations for a large spatial HRTF set within a few hours even 
on a standard desktop computer [172]. 


5. Other issues related to HRTF acquisition 


Over decades, HRTFs have been collected and stored in databases. Such 
databases are important for educational aspects, training of neural network 
algorithms [34, 37] and further research [23, 25-28, 173]. While in the early 
HRTF research days, HRTFs have been stored by each lab in a different 
format, since 2015, the spatially oriented format for acoustics (SOFA) is available to 
store HRTFs in a flexible but well-described way facilitating an easy exchange 
between the labs and applications. SOFA is a standard of the Audio Engineering 
Society under the name AES69. SOFA provides a uniform description of spatially 
oriented acoustic data such as HRTFs, spatial room impulse responses, and 
directivities [15]. 

When it comes to anthropometric data, unfortunately, there is currently no 
common format to specify and exchange anthropometric data. This is partially 
because currently, it is not known, which data are important. Some laboratories use 
the CIPIC parameters [88], some have extended them [174], and others have cre- 
ated whole new sets of parameters [128, 175]. An overview of currently used 
anthropometric parameters can be found in [176]. The development of parametric 
pinna models may shed light on the relevance of parameters needed to be stored in 
the future. The listener’s geometry can also be stored in non-parametric represen- 
tations such as meshes and point clouds of listener’s ears and head. To this end, 
typical 3D dataset formats are used, e.g., OBJ, PLY or STL. These formats are widely 
used in computer graphics and thus easily accessible by many corresponding appli- 
cations. A large collection of HRTF databases stored in SOFA, with some of them 
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combined with meshes stored in OBJ, PLY and STL files is available at the SOFA 
website.’ 

When HRTFs are obtained, there is strong demand to evaluate their quality. This 
is especially interesting when comparing the results from numerical HRTF calcula- 
tions. The evaluations can be performed at various levels: geometrical, acoustical 
and perceptive. The evaluation at the geometric level can be done by comparing the 
deviation between two meshes of the pinna and representing the deviation as the 
Hausdorff distance [177]. The evaluation at the acoustic level can be done by 
calculating the spectral distortion 


1 HRTF(x*,f,)\ 
s=) [201 (x o), (7) 


HRTF (x*, f;) 


where HRTF (x*, f;) denotes the calculated and HRTF (x* , f;) the measured 
one, summarised over N discrete frequencies. The evaluation at the perceptual level 
can be simulated by means of auditory modelling [50] or direct performance of 
localisation experiment [50, 90, 147]. Especially the evaluation of localisation errors 
in the median plane can be relevant because the sound localisation in the median 
plane is directly related to the quality of monaural spectral features in the HRTFs 
[46, 178]. A calculated HRTF set yielding similar perceptual results as the natural 
listener’s HRTFs can be described as perceptually valid. 


6. Conclusions 


With a specialised measurement setup, acoustic HRTF measurements can be 
done within a few minutes. Still, such setups are expensive and require the listener 
to sit or stand still for the whole measurement duration. The requirement of 
specialised components has been limiting the popularity of the acoustic methods. 
Recent advances, however, have been made by integrating head-movement track- 
ing in systems to be used at home, especially since the commercialisation of VR 
headsets. These advances provide an easy-to-use measurement setup, but still need 
investigation on how many and which measurement positions are crucial to acquire 
a sufficient measurement grid for perceptually valid HRTFs. 

With the availability of numerical HRTF calculations, the acquisition of 
personalised HRTFs has undergone significant advances. While the acoustic HRTF 
measurement still remains the reference acquisition method, numerical HRTF cal- 
culation paves the road towards personalised HRTFs available for a wide audience. 
The most widely used approaches, FEM, FDTD, BEM and BEM coupled with the 
FMM, when applied under optimal conditions, can yield acoustically and perceptu- 
ally valid results. 

Machine learning and neural networks gain increasing popularity and, in the 
future, may even further push the usability of numerical HRTF calculations. For 
example, neural networks might be able to support the photogrammetric mesh 
acquisition or even estimate the HRTFs directly from listener-specific anthropo- 
metric data such as photographs. Further improvements in terms of efficiency, 
accuracy and precision are still ongoing subject of research. 

Despite the clear definition when it comes to storing an HRTF data set by means 
of SOFA, a similar definition for the description of anthropometric data is still not 


1 https://www.sofaconventions.org/mediawiki/index.php/Files 
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available. This might be rooted in our poor understanding of the importance of 
parts of the pinna and its contribution to the HRTF. Here, a clear goal is to better 
understand the anthropometry and its relation with HRTFs. All this future work 
heads into the direction of expanding the access to personalised HRTFs enabling 
their availability for everyone. 
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